{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/ocr/ocr_visual_document_deid.ipynb)\n",
    "\n",
    "\n",
    "## De-Identification\n",
    "\n",
    "Introducing our advanced healthcare deidentification model, effortlessly deployable with a single line of code. This powerful solution integrates state-of-the-art algorithms like ner_deid_subentity_augmented, ContextualParser, RegexMatcher, and TextMatcher, alongside a streamlined Deidentification stage. It efficiently masks sensitive entities such as names, locations, and medical records, ensuring compliance and data security in medical texts. Utilizing OCR capabilities, it also redacts detected information before saving the processed file to the specified location."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### Starting the session"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📋 Loading license number 0 from C:\\Users\\gadde/.johnsnowlabs\\licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json\n",
      "👌 Launched \u001B[92mcpu optimized\u001B[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, 🕶Spark-OCR==5.3.2, running on ⚡ PySpark==3.1.2\n"
     ]
    }
   ],
   "source": [
    "from johnsnowlabs import nlp\n",
    "nlp.install(visual=True)\n",
    "nlp.start(visual=True)"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-06-24T10:27:47.436477Z",
     "start_time": "2024-06-24T10:27:21.668104700Z"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Warning::Spark Session already created, some configs may not take.\n",
      "Warning::Spark Session already created, some configs may not take.\n",
      "pdf_deid_pdf_output download started this may take some time.\n",
      "Approx size to download 1.6 GB\n",
      "[OK!]\n"
     ]
    }
   ],
   "source": [
    "#loading the model\n",
    "model = nlp.load(\"en.image_deid\")"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-06-24T10:28:08.210754700Z",
     "start_time": "2024-06-24T10:27:47.452292500Z"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## PDF De-Identification\n",
    "\n",
    "With the specified input and output paths provided as arguments, the model efficiently processes PDF files, performing de-identification as needed, and seamlessly stores the processed documents at the designated location."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Warning::Spark Session already created, some configs may not take.\n"
     ]
    }
   ],
   "source": [
    "! wget https://github.com/JohnSnowLabs/nlu/blob/master/tests/datasets/ocr/deid/deid2.pdf\n",
    "! wget https://github.com/JohnSnowLabs/nlu/blob/master/tests/datasets/ocr/deid/download.pdf\n",
    "! wget https://github.com/JohnSnowLabs/nlu/blob/master/tests/datasets/ocr/deid/download_deidentified.pdf\n",
    "! wget https://github.com/JohnSnowLabs/nlu/blob/master/tests/datasets/ocr/deid/deid2_deidentified.pdf\n",
    "\n",
    "#provide the input and the output path\n",
    "input_path,output_path = ['download.pdf',' deid2.pdf'], ['download_deidentified.pdf',' deid2_deidentified.pdf']\n",
    "\n",
    "#predict and save the deidentified pdf's.\n",
    "dfs = model.predict(input_path, output_path=output_path)"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-06-24T10:33:43.625036300Z",
     "start_time": "2024-06-24T10:33:40.477056300Z"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "collapsed": false
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
